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ABSTRACT 

Recent research has taken advantage of Wikipedia' s multi- 
lingualism as a resource for cross- language information re- 
trieval and machine translation, as well as proposed tech- 
niques for enriching its cross-language structure. The avail- 
ability of documents in multiple languages also opens up 
new opportunities for querying structured Wikipedia con- 
tent, and in particular, to enable answers that straddle dif- 
ferent languages. As a step towards supporting such queries, 
in this paper, we propose a method for identifying mappings 
between attributes from infoboxes that come from pages 
in different languages. Our approach finds mappings in a 
completely automated fashion. Because it does not require 
training data, it is scalable: not only can it be used to find 
mappings between many language pairs, but it is also ef- 
fective for languages that are under-represented and lack 
sufficient training samples. Another important benefit of 
our approach is that it does not depend on syntactic simi- 
larity between attribute names, and thus, it can be applied 
to language pairs that have distinct morphologies. We have 
performed an extensive experimental evaluation using a cor- 
pus consisting of pages in Portuguese, Vietnamese, and En- 
glish. The results show that not only does our approach 
obtain high precision and recall, but it also outperforms 
state-of-the-art techniques. We also present a case study 
which demonstrates that the multilingual mappings we de- 
rive lead to substantial improvements in answer quality and 
coverage for structured queries over Wikipedia content. 

1. INTRODUCTION 

With over 17.9 million articles and 10 million page views 
per month [38], Wikipedia has become a popular and im- 
portant source of information. One of its most remarkable 
aspects is multilingualism: there are Wikipedia articles in 
over 270 languages. This opens up new opportunities for 
knowledge sharing among people that speak different lan- 
guages both within and outside the scope Wikipedia. For 
example, cross-language links, that connect an article in one 
language to the corresponding article in another, have been 
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used to derive better translations in cross-language informa- 
tion retrieval and machine translation [11, 24, 30, 32]. 

But even though many languages are represented in Wiki- 
pedia, the geographical distribution of Wikipedia users is 
highly skewed. One of the explanations for this effect is 
that many languages, including languages spoken by large 
segments of the world population, are under-represented. 
For example, there are 328 million English speakers world- 
wide and 20% of the Wikipedia pages are in English; in con- 
trast, there are 178 million Portuguese speakers and only 
3.75% of Wikipedia articles are in Portuguese. Recognizing 
this problem, there are a number of ongoing efforts which 
aim to improve access to Wikipedia content. By leveraging 
the existing multilingual Wikipedia corpus, techniques have 
been proposed to: combine content provided in documents 
from different languages and thereby improve both docu- 
ments [1, 5]; find missing cross-language links [29, 33]; aid 
in the creation of multilingual content [19]; and help users 
who speak different languages to search for named entities 
in the English Wikipedia [35] . 

Besides textual content, Wikipedia has also become a 
prominent source for structured information. A growing 
number of articles contain an infobox that provides a struc- 
tured record for the entity described in the article. This has 
enabled richer queries over Wikipedia content (see e.g., [2, 
17, 25]). While much work has been devoted to supporting 
structured queries, no previous effort has looked into pro- 
viding support for multilingual structured queries. In this 
paper, we examine the problem of matching schemas of in- 
foboxes represented in different languages, a necessary step 
for supporting these queries. 

By discovering multilingual attribute correspondences, it 
is possible to integrate information from different languages 
and to provide more complete answers to user queries. A 
common scenario is when the answer to a query cannot be 
found in a given language but it is available in another. In a 
study of the 50 topics used in the GikiCLEF campaign [13], 
just nine topics had answers in all ten languages used in 
the task [6]. However, almost every query had an answer in 
the English Wikipedia. Thus, by supporting multi-language 
queries and providing the relevant English documents as 
part of the answer, recall can be improved for most other lan- 
guages. In addition, some queries can benefit from integrat- 
ing information present in multiple infoboxes represented in 
different languages. Consider the query Find the genre and 
the studio that produced the film "The Last Emperor". To 
provide a complete answer to this query, we need to combine 
the information from the two infoboxes in Figure 1. 
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There are several challenges involved in finding multilin- 
gual correspondences across infoboxes. Even within a lan- 
guage, finding attribute correspondences is difficult. Al- 
though authors are encouraged to provide structure in Wiki- 
pedia articles, e.g., by selecting appropriate templates and 
categories, they often do not follow the guidelines or follow 
them loosely. This leads to several problems, in particu- 
lar, schema drift — the structure of infoboxes for the same 
entity type {e.g., actor, country) can differ for different in- 
stances. Both polysemy and synonymy are observed among 
attribute names: a given name can have different seman- 
tics {e.g., born can mean birth date or country of birth) 
and different names can have the same meaning {e.g., alias 
and other names). This problem is compounded when we 
consider multiple languages. Figure 1 shows an example of 
heterogeneity in infoboxes describing the same entity in dif- 
ferent languages. Some attributes in the English infobox 
do not have a counterpart in the Portuguese infobox and 
vice-versa. For instance: produced by, editing by, distributed 
by, and budget are omitted in the Portuguese version, while 
genero (genre) is omitted in the English version. An anal- 
ysis of the overlap among attribute sets from infoboxes in 
English and Portuguese (see Table 5) shows that on aver- 
age only 42% of the attributes are present in both languages. 
Besides the variation in structure, there are also inconsisten- 
cies in the attribute values, for example: running time is 160 
minutes in the English version and 165 minutes in the Por- 
tuguese version; Ryuichi Sakamoto appears under Music by 
in English and under Elenco original (cast) in Portuguese. 

To identify multilingual matches, a possible strategy would 
be to translate the attribute names and values using a mul- 
tilingual dictionary or a machine translation system, and 
then apply traditional schema or ontology matching tech- 
niques [31, 10, 12]. However, this strategy is limited since, in 
many cases, the correct correspondence is not found among 
the translations. For example, in articles describing movies, 
the correct alignment for the English attribute starring is 
the Portuguese attribute elenco original. However, the dic- 
tionary translation is estrelando for the former and original 
cast for the latter, and neither is used in the Wikipedia in- 
fobox templates to name an attribute. WordNet is another 
source of synonyms that can potentially help in matching, 
but its versions in many languages are incomplete. For in- 
stance, the Vietnamese WordNet [36] covers only 10% of the 
senses present in the English WordNet. Furthermore, tradi- 
tional techniques such as string similarity may fail even for 
languages that share words with similar roots. Consider the 
term editora, which in Portuguese means publisher. Using 
string similarity, it would be very close to editor, but this 
would be a false cognate. 

Recently, techniques have been proposed to identify mul- 
tilingual attribute alignments for Wikipedia infoboxes. But 
these have important shortcomings in that they are designed 
for languages that share similar words [1, 5], or demand a 
considerable amount of training data [1]. Consequently, they 
cannot be effectively applied to languages with distinct rep- 
resentations or different roots; and their applicability is also 
limited for under-represented languages in Wikipedia, which 
have few pages and thus, insufficient training data. 
Contributions. We propose WikiMatch, a new approach 
to multilingual schema matching that addresses these limi- 
tations. WikiMatch gathers similarity evidence from multi- 
ple sources: attribute values, link structure, co-occurrence 
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Figure 1: Excerpts from English and Portuguese in- 
foboxes for the Film The Last Emperor. 

statistics within and across languages, and an automati- 
cally derived bilingual dictionary. These different sources 
of similarity information are combined in a systematic man- 
ner: the alignment algorithm prioritizes the derivation of 
high-confidence correspondences and then uses these to find 
additional ones. By doing so, it is able to obtain both 
high precision and recall. The algorithm finds, in a single 
step, inter- and intra-language correspondences, as well as 
complex, one-to-many correspondences. Because WikiMatch 
does not require training data, it is able to handle under- 
represented languages; and since it does not rely on string 
similarity on attribute names, it can be applied both to sim- 
ilar and morphologically distinct languages. Furthermore, it 
does not require external resources, such as bilingual dictio- 
naries, thesauri, ontologies, or automatic translators. 

We present a detailed experimental evaluation using in- 
foboxes in Portuguese, Vietnamese, and English. We also 
compare WikiMatch to state-of-the-art techniques from data 
integration [3] and Information Retrieval [20] , as well as to a 
technique specifically designed to align infobox attributes [5] . 
The results show that WikiMatch consistently outperforms 
existing approaches in terms of F- measure, and in particu- 
lar, it obtains substantially higher recall. We also present a 
case study where we show that, through the use of the cor- 
respondences derived by WikiMatch, a multilingual querying 
system is able to derive higher-quality answers. 

2. PROBLEM DEFINITION 

A Wikipedia article is associated with and describes an 
entity (or object). Let A be an article in language L associ- 
ated with entity E. Among the different components of A, 
here, we are interested in its title; infobox, which consists 
of a structured record that summarizes important informa- 
tion about E; and cross-language links, URLs of pages in 
languages other than L that describe E. An infobox / con- 
tains a set of attribute- value pairs {( ai, v± ),...,( a n , v n )}. 
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Figure 1(a) shows the infobox of an English article with 14 
attribute-value pairs. Since there is a one-to-one relation- 
ship between / and its associated E, we use these terms 
interchangeably in the remainder of the paper. We define 
the set of attributes in an infobox / as the schema of I (Si). 

The value v of an attribute a in an infobox / may contain 
one or more hyperlinks to other Wikipedia entities. For ex- 
ample, in Figure 1(a), the value for the attribute Directed by 
contains a hyperlink to the entity Bernardo Bertolucci. We 
denote such a hyperlink by the tuple h — (I,v,J), where 
J is the infobox pointed to by v. We distinguish between 
hyperlinks that point to another entity in the same language 
(which define relationships) and hyperlinks that point to ar- 
ticles describing the same entity in different languages. We 
refer to the latter as cross-language links. We denote by 
cl = (II, II') a link between the documents in languages L 
and L' which represent the same entity. These links can be 
found in most articles and are located on the pane to the 
left of the article. 

An article is also associated with an entity type T. For 
example, the article in Figure 1(a) corresponds to the type 
"Film". There are different ways to determine the entity 
type for an article, including from the categories defined 
for the article; from the template defined for the infobox; 
or from the structure of the infobox. Given a set Xl of 
infoboxes in language L associated with entity type T, we 
refer to the set of all distinct attributes in Xl as the schema 
of T (St)- Given two infoboxes II and I L > with type T 
that are connected by a cross- language link, we refer to the 
union of the attributes in their schemas, Sd = Si |J Si' , as a 
dual-language infobox schema. The problem we address can 
be stated as follows: Given two sets of infoboxes Xl and X L ' 
in languages L and L', respectively, such that both sets are 
associated with the entity type T and the infoboxes in the 
sets are connected through cross-language links. To match 
St and S T , the schemas of infoboxes in the two sets, we 
need to find correspondences (or matches) ( a, a ) such that 
a is an attribute of Si, a' is an attribute of Si>, and a and 
a have the same meaning. 

3. THE WIKIMATCH APPROACH 

WikiMatch works in three steps. First, it identifies map- 
pings between entity types in different languages, e.g., it 
determines that type "Film" in English corresponds to type 
"Filme" in Portuguese. It then computes, for each type, the 
similarity for all attribute pairs within and across languages. 
To do so, it leverages information available in Wikipedia, 
including: attribute values, link structure of articles, cross- 
language links, and an automatically-derived bilingual dic- 
tionary. As another source of similarity, WikiMatch uses La- 
tent Semantic Indexing (LSI) [7] as a correlation measure. 
Because WikiMatch does not rely on string similarity func- 
tions for attribute names, it is effective even for languages 
that do not share words with similar roots. 

Even though it is useful to consider multiple similarity 
sources, an important challenge that ensues is how to com- 
bine them. While searching for attribute correspondences, 
WikiMatch incrementally combines the different sources, and 
selects the high- confidence matches first, in an attempt to 
avoid error propagation to subsequent matches. As the last 
step, to improve recall, the derived correspondences are used 
to help identify additional correspondences for attributes 
that remain unmatched. 



3.1 Matching Entity Types across Languages 

There are different mechanisms to associate entities with 
types, including the assignment of categories to articles and 
template types to infoboxes. It is also possible to cluster 
the infoboxes and infer types based on their structure [26]. 
Regardless of the mechanism used, in Wikipedia, the en- 
tity type system is different for different languages, thus 
an important task is to identify the mappings between the 
types. WikiMatch adopts a simple approach that leverages 
the cross- language links. The intuition is that if a set of 
infoboxes belonging to entity type T often link (through a 
cross-language link) to infoboxes of in a different language of 
type T', then it is likely that types T and T' are equivalent. 

3.2 Computing Cross-Language Similarities 

Given two schemas St and S' T for a type T, in languages 
L and L' respectively, our goal is to identify correspondences 
between attributes in these schemas (Section 2). To deter- 
mine if a pair of attributes < a, a' >, where a £ St and 
a G S T , forms a correspondence, we compute the similarity 
between a and a' by combining different sources of informa- 
tion, notably: value similarity, attribute-name correlation, 
and cross-language link structure. 

Cross-Language Value Similarity. Because of the struc- 
tural heterogeneity among infoboxes in different languages 
(see Appendix A) , by combining their attributes in a unified 
schema for each distinct type, we gather more evidence that 
helps in the derivation of correspondences. We also collect 
for each attribute a in an entity schema St, the set of val- 
ues v associated with a in all infoboxes with type T. Value 
similarity for two attributes is then computed as the cosine 
similarity between their value vectors. 

Since a concept can have different representations across 
languages, direct comparison between vectors often leads to 
low similarity scores. Thus, we use an automatically created 
translation dictionary to help improve the accuracy of the 
similarity score: whenever possible, the values are translated 
into the same language before their similarity is computed. 
Similar to Oh et al. [29] , we exploit the cross-language links 
among articles in different languages to create a dictionary 
for their titles. The translation dictionary from a language 
L to language L' is built as follows. For each article A in 
L with a cross- language link to article A! in L', we add an 
entry to the dictionary that translates the title of article A 
to the title of article A' . 

Given an attribute a with value vector v a in language L, 
an attribute a with value vector v a > in language L\ and a 
translation dictionary D, we construct the translated value 
vector of a as follows: if a value of v a can be found in D, we 
replace it by its representation in L' . We denote the trans- 
lated value vector of a as , and define the value similarity 
between a and a as: vsim(a, a) — cos(va,v a '), where the 
vector components are the raw frequencies (tf). 
Example 1. Given the vectors for nascimento and born re- 
spectively as: v a ={1963, Irlanda:l, 18 de Dezembro 1950:1, 
Estados Unidos:l} and ?v={1963, Ireland:l, June 4 1975:1, 
United States: 2}, where the numbers after the colons in- 
dicate the frequency of each value. Translating v a , we get 
v l a ={1963, Ireland:l, December 18 1950:1, United States:l}. 
Thus, vsim(v a ,v a f) = cos(va,v a /) = 0.71 ■ 
Link Structure Similarity. Attribute values in an infobox 
often link to other articles in Wikipedia. For example, at- 
tribute Directed by in Figure 1(a) has the value Bernardo 
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d<i d 2 d 3 d 4 d 5 . . . d n 
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Figure 2: Some attributes for Actor in Pt-En 

Bertolucci that links to an article for this director in En- 
glish. Similarly, the value of attribute Diregdo in Figure 1(b) 
links to an article for this director in Portuguese. Because 
of the multilingual nature of Wikipedia, the two articles 
for Bernardo Bertolucci are linked by a cross- language link. 
Similar to Bouma et al. [5], we leverage this feature as an- 
other source of similarity. In this example, the link structure 
information helps us determine that < Directed by,Direcao> 
match. We define the link structure set of an attribute in an 
entity type schema S as the set of outgoing links for all of its 
values. Given two attributes, the larger the intersection be- 
tween their link structures, the more likely they are to form 
a correspondence. Two values are considered equal if their 
corresponding landing articles are linked by a cross-language 
link. Let Is (a) = {P a \i = l..n} and ls(a f ) = {l j a ,\j = l..m} 
be the link structure sets for attributes a and a . The link 
structure similarity Isim between these attributes is mea- 
sured as: lsim(a,a) = cos(ls(a),ls(a')). 

For attribute values which have links, the difference be- 
tween value and link similarity lies in using Wikipedia href 
links in two ways: their anchor texts (vsim) and their target 
URI article names (Isim). Since attribute values are hetero- 
geneous (anchor texts referring to the same entity may be 
different, e.g., "United States" and "USA") and not all val- 
ues have links, both vsim and Isim are necessary. 
Attribute Correlation. Correlation has been successfully 
applied in holistic strategies to identify correspondences in 
Web form schema matching [15, 27, 34]. There, the intu- 
ition was that synonyms should not co-occur in a given form 
and therefore, they should be negatively correlated. For a 
given language, the same intuition holds for attributes in an 
infobox — synonyms should not appear together. However, 
for identifying cross- language correspondences, the opposite 
is true: if we combine the attribute names for corresponding 
infoboxes across languages creating a dual-language infobox 
schema, cross-language synonyms are likely to co-occur. 

While previous works applied absolute correlation mea- 
sures for all attribute pairs, we use Latent Semantic Indexing 
(LSI) [7]. Our inspiration comes from the CLIR literature, 
where LSI was one of the first methods applied to match 
terms across languages [20] . But while LSI has traditionally 
been applied to terms in free text, here we use it to estimate 
the correlation between schema attributes. 



Let D = {di\i = l..m} be the set of dual-language in- 
foboxes associated with entity type T, and A = {dj\j — 
l..n} the set of unique attributes in D. In the occurrence 
matrix M(n x m) (with n rows and m columns), M(i, j) — 1 
if attribute a\ appears in dual- language infobox dj, and 
M(i,j) — otherwise. Each row in the matrix corresponds 
to the occurrence pattern of the corresponding attribute over 
D. See Figure 2(a) for an example of such a matrix. We ap- 
ply the truncated singular value decomposition (SVD) [20] 
to derive M = UfSfVf by choosing the / most impor- 
tant dimensions and scaling the attribute vectors by the top 
/ singular values in matrix S. SVD causes cross- language 
synonyms to be represented by similar vectors: if attribute 
names are used in similar infoboxes, they will have similar 
vectors in the reduced representation. This is what makes 
LSI suitable for cross- language matching. 

To measure the correlation between attributes in different 
languages, we compute the cosine between their vectors. For 
attributes in the same language, we take the complement of 
the cosine between their vectors, and if the attributes co- 
occur in an infobox, we set the LSI score to as they are 
unlikely to be synonyms. Thus, in WikiMatch, the LSI score 
for attributes a p and a q is computed as: 

(cosine(o%, o%) if a p in L A a q in L' 

if a p , a q in II or 1 L > 

1 — cosine(a%,, a%) if a p A a q in L or L' 
For attributes in the same language, a LSI score of 1 means 
they never co-occur in a dual-language infobox. Conse- 
quently, they are likely to be intra- language synonyms. In 
contrast, for attributes in different languages, a LSI score of 
1 means they co-occur in every dual- language infobox. Thus, 
they have a good chance of being cross- language synonyms. 

Note that, as illustrated in Figure 1, corresponding in- 
foboxes are not parallel, i.e., there is not a one-to-one map- 
ping between attributes in the two languages. As a conse- 
quence, LSI is expected to yield uncertain results for cross- 
language synonyms. And when rare attributes are present, 
the same outcome will be observed for intra- language syn- 
onyms. As we discuss in Section 4, when used in isolation, 
LSI is not a reliable method for cross- language attribute 
alignment. However, if combined with the other sources of 
similarity, it contributes to high recall and precision. 

Advantages of using LSI for finding cross-language syn- 
onyms include: (i) all attribute names are transformed into 
a language-independent representation, thus there is no need 
for translation; (ii) external resources such as dictionaries, 
thesauri, or automatic translators are not required; (iii) lan- 
guages need not share similar words; and (iv) LSI can im- 
plicitly capture higher order term co-occurrence [18]. 

We have examined other alternatives for computing at- 
tribute correlations, including the measures used in [15, 27, 
34]. However, since these were defined to identify synonyms 
within one language, they cannot be directly applied to 
our problem. We have also extended them to consider co- 
occurrence frequency in the dual infoboxes, but as we discuss 
in Appendix B, LSI outperforms all of them. This can be 
explained in part by the dimensionality reduction achieved 
by SVD and the consideration of the co-occurrence patterns 
of LSI for attribute pairs over all dual-language infoboxes. 

3.3 Deriving Correspondences 

The effectiveness of any given similarity measure varies 
for different attributes and entity types. For example, two 
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attributes may have different values and yet be synonyms, or 
vice versa. Thus, to derive correspondences, an important 
challenge is how to combine the similarity measures. We 
propose an AttributeAlignment algorithm (Algorithm 1) 
which combines different similarity measures in such a way 
that they reinforce each other. Given as input the set of all 
attributes for infoboxes that belong to a given type, it groups 
together attributes that have the same label, and for these, 
combines their values — we refer to the set of such groups as 
AG. The attribute groups in AG are then paired together, 
and for each pair, the similarity measures are computed 
(Section 3.2). This step creates a set of tuples that asso- 
ciate similarity values with each attribute pair: (< a p ,a q >, 
vsim, Isim, LSI). The tuples with a LSI score greater than 
a threshold Tlsi are then added to a priority queue P. In- 
tuitively, a pair of matching attributes should have a high 
positive correlation. However, due to the heterogeneity in 
the data, this correlation may be weak, so Tlsi should be 
set to a low value. 

The tuples in P are sorted in decreasing order of LSI score. 
The goal is to prioritize matches that are more likely to be 
correct and avoid the early selection of incorrect matches, 
which can result in error propagation to future matches. The 
similarities for a pair of attributes dp , dq are combined as 
follows: If max(vsim(a p , a q ),lsim(a p , a q )) > T S i m then < 
a p ,a q > is a certain candidate correspondence. The intuition 
is that two attributes form a certain correspondence if they 
are correlated and this is corroborated by at least one of the 
other similarity measures. So that certain correspondences 
are selected early, T S i m is set to a high value. 

One potential drawback of WikiMatch is that it requires 
these two thresholds to be set. We have studied the behavior 
of WikiMatch using different thresholds, and as we discuss 
in Appendix B, our approach remains effective and obtains 
high F-measure for a broad range of threshold values. 

Figure 2(a) shows a subset of the attributes in English 
and Portuguese for the type Actor. The cells in this matrix 
contain the number of occurrences for an attribute in each 
dual-language infobox. The matches in the ground truth 
are indicated by the arrows. Notice that died matches two 
attributes in Portuguese. Figure 2(b) shows some of the at- 
tribute pairs in P, with their similarity scores. For example, 
the pair <born, nascimento> is a certain match because all 
similarity scores are high. 

If a candidate correspondence < a p ,a q > does not satisfy 
the constraint in line 10 (Algorithm 1), it is added to the 
set of uncertain matches U (line 13) to be considered later 
(Section 3.4). Otherwise, if it does satisfy the constraint, it 
is given as input to IntegrateMatch.es (Algorithm 2), which 
decides whether it will be integrated into an existing match, 
originate a new one, or be ignored. IntegrateMatch.es out- 
puts a set of matches, M, where each match m — {a\ ~ 
.. ~ a m } includes a set of synonyms, both within and across 
languages. IntegrateMatches takes advantage of the corre- 
lations among attributes to determine how to integrate the 
new correspondence into the set of existing matches. If nei- 
ther of the attributes in the new correspondence appears in 
the existing matches M, a new matching component is cre- 
ated (line 5). If at least one of the attributes is already in a 
match uij in M, e.g., suppose a p appears in mj, and the LSI 
score between a q and all attributes aj in rrij is greater than 
the correlation threshold Tlsi (line 8), then a q becomes a 
new element in rrij (line 9, where + ~ {aq} denotes that a q 



is added to the existing match mj), otherwise, it is ignored. 
The idea is to test for positive correlations between all at- 
tributes of a match to see whether it is possible to integrate 
the attributes in question into the existing matches. Since 
Tlsi is set low, the requirement of having positive corre- 
lations with all attributes in an existing match is not too 
strict and helps merge intra- and inter- language synonyms. 
We should note, however, that by relaxing this constraint 
(e.g., to include only some of the attributes), it is possible 
to increase recall at the cost of lower precision. 

IntegrateMatches is based on the algorithm used by Su 
et al. [34] to construct groups of Web form attributes. How- 
ever, our experiments (Section 4.2) show that, attribute cor- 
relation alone, is not sufficient to obtain high F-measure 
scores. Further, since our correlation measures work for at- 
tribute pairs both within and across languages, as illustrated 
in the example below, IntegrateMatches can discover both 
intra and cross- language synonyms. 

Example 2. Consider the attribute pairs in Figure2(b) for 
type Actor, ordered by descending LSI scores, with Tlsi =0.1. 
Assume that the set of existing matches M includes m = 
{died ~ falecimento} , and we have two candidate pairs, 
pi —<died, morte> and p2 —<died, nascimento>. Since 
the LSI score for morte and falecimento is greater than Tlsi, 
morte is integrated into m, i.e., m — { died ~ falecimento 
~ morte}. In contrast, p2 is not added to m since the LSI 
score for falecimento and nascimento is zero as they are in 
the same language and co-occur often. ■ 

Algorithm 1 AttributeAlignment 

1: Input: Set of infobox attributes for an entity type T 

2: Output: Set of matches M 

3: begin 

4: M <- 0, P <- 

5: for each pair < a p , a q > such that a p , a q G AG do 

6: Compute vsim, Isim, LSI 

7: P^PU(< a p ,a q >, vsim, Isim, LSI) \ LSI > T LS i 

8: while P / do 

9: Choose pair < a p , a q > with the highest LSI score from P 

10: if max(vsim(a p , a q ), lsim(a p , a q )) > T s i m then 

11: M <(— IntegrateMatches (< a p ,a q >,M) 

12: else 

13: U <— < a p ,a q > /*buffering uncertain matches*/ 

14: Remove < a p , a q > from P 

15: U' <— ReviseUncertain(t/) 

16: for each u G U f do 

17: M <(— IntegrateMatches (u, M) 

18: end 

Algorithm 2 IntegrateMatches 

1: Input: candidate pair < a p ,a q >, set of current matches M 

2: Output: updated set of matches M 

3: begin 

4: if neither a p nor a q G M then 
5: M <(— M + {a p ~ a q } 

6: else if either a p or a q appears in M 

7: /* suppose a p appears in rrij and a q does not appear*/ 

8: for each aj G mj, s.t. LSI q j > Tlsi do 

9: rrij <(— mj + (~ {a q }) 

10: end 



3.4 Revising Uncertain Matches 

Since our alignment algorithm prioritizes high-confidence 
correspondences, it may miss correspondences that are cor- 
rect but that have low confidence — the uncertain matches. 
Consider, for example, value similarity. While born and 
morte (died) are not equivalent, their similarity is high since 
they share many values and links — both attributes have val- 
ues that correspond to dates and places. On the other 
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hand, although outros nomes and other names are equiv- 
alent, their value similarity is low as they do not share val- 
ues or links. Consequently, even though high value simi- 
larity provides useful evidence for deriving attribute corre- 
spondences, it may also prevent correct matches from being 
identified. The ReviseUncertain step uses the set M of 
matches derived by At tribute Alignment (line 15) to iden- 
tify additional matches, by reinforcing or negating the un- 
certain candidates (in set U). A challenge in this step is 
how to balance the potential gain in recall with a potential 
loss in precision. Our solution to this problem is to consider 
only the subset U' of attribute pairs in U whose attributes 
are highly correlated with the existing matches. To capture 
this, we introduce the notion of inductive grouping score. 
Let < a, a > be an uncertain correspondence in U, and let 
C a and C a > be the set of matched attributes co-occurring 
with a and a , respectively, in their mono- lingual schemas. 
The inductive grouping score between a and a is the aver- 
age grouping score of a and a' with each attribute in C a and 

Ca/ '' i 

g(a,a') = — ^ g(a, c a ) * g(a , c a ) 

1 1 c a ec a ,c' a ec' a \c a ~c' a 
where the grouping score g is computed as follows: 

g(a p ,a q ) = . * q , 
min(O p , Oq) 

O p and O q are the number of occurrences of attributes a p 
and a q , and O pq is the number of times they co-occur in the 
set of infoboxes. Note that the grouping score is computed 
for the schemas of the two languages separately. The induc- 
tive grouping score is high if a p and a q co-occur often with 
the attributes in the discovered matches. 

The final step is to integrate revised matches (lines 16-18). 
We take advantage the certain matches in M to validate the 
revised matches U' ': IntegrateMatch.es is invoked again but 
this time it considers pairs with similarity lower than T S im. 
Although we could first threshold on different values of T S i m , 
as we discuss in Section 4.2, revising uncertain matches as 
a separate step improves recall while maintaining high pre- 
cision for a wide range of T S i m values. 

Example 3. Consider the attribute pairs in Figure 2(b), let 
M={born~nascimento, spouse^ conjuge} be the set of exist- 
ing matches. The pairs <other names, outros nomes> and 
<born,morte> are uncertain candidates since their value 
similarities are lower than the threshold. If the attributes in 
these pairs co-occur often with born and spouse, the induc- 
tive grouping scores g of < other names, outros nomes> and 
<born, morte> are high, and thus, these candidate matches 
will be revised and added to U' . Since {born^nascimento} 
has been identified as a match, morte cannot be integrated 
into this match because morte and nascimento are in the 
same language and co-occur in infoboxes (their LSI score is 
zero). In contrast, neither outros nomes nor other names 
appear in M, so this pair is added as a new match. ■ 

4. EXPERIMENTAL EVALUATION 

Datasets. We collected Wikipedia infoboxes related to 
movies from three languages: English, Portuguese, and Viet- 
namese. Our aim in selecting these languages was to get 
variety in terms of morphology and in the number of in- 
foboxes. Portuguese and English share words with similar 
roots, while Vietnamese is very different from the other two 
languages; and there are significantly fewer infoboxes for 



the pair Vietnamese-English (Vn-En) than for Portuguese- 
English (Pt-En) — this is also reflected in the number of types 
covered by the Vietnamese infoboxes (see below). We se- 
lected Portuguese and Vietnamese infoboxes that belong to 
articles which have cross- language links to the equivalent 
English article. The dataset for the Pt-En language pair 
consists of 8,898 infoboxes, while there are 659 infoboxes for 
the Vn-En pair. Infoboxes that belong to the same entity 
type are grouped together (Section 3). There are 14 such 
groups for Pt-En, and 4 for Vn-En. 

Ground Truth. We created the ground truth for all entity 
types in the dataset. A bilingual expert labeled as correct 
or incorrect all the correspondences containing attributes 
from two distinct languages. A pair of attributes ( a, a' ) 
is considered a correct alignment if a and a have the same 
meaning. The ground truth set for the Pt-En pair has 315 
alignments while the Vn-En pair has 160 alignments. 
Evaluation Metrics. To account for the importance of dif- 
ferent attributes and, consequently, of the matches involving 
them, we use weighted scores. Intuitively, a match between 
frequent attributes will have a higher weight. Let C be the 
set of cross- language matches derived by our algorithm; Q 
be the cross-language matches in the ground truth; St the 
set of attributes of entity type T in language L; and S' T be 
the attributes in language L' of the corresponding type of T. 
Given an attribute ai £ St, we denote by c(c^) and cg(ch) 
the set of attributes in S' T that correspond to ai in C and 
Q, respectively. Let Ac and Aq the set of attributes in St 
that appear in C and Q, respectively. The weighted scores 
are computed as follows: 

Precision = ^ — \^]_ — -Pr(c(ai)) (1) 

Recall = sr ~ — -Mc(c(a l )) (2) 

Pr{c{ai)) = ^2 " — fTT * correct(ai, a'j) (3) 

Rc{c(ai))= 22 v ~ rri *correct(ai,aj) (4) 

a;.Gc G ( ai ) ^a> k ec G ( ai ) \ a k\ 

where \ai\ represents the frequency of attribute ai in the 
infobox set; correct(ai, a'j) returns 1 if the extracted corre- 
spondence < ai,a'j > appears in Q and otherwise. Similar 
to [15], we compute precision and recall as the weighted av- 
erages over the precision and recall of each attribute a* (Eq. 
1 and 2), and the precision and recall of attribute ai are 
also averaged by the contribution of each attribute a'j in 
S' T which corresponds to ai (Eq. 3 and 4). We compute F- 
measure as the harmonic mean of precision and recall. The 
intuition behind these measures is shown in Example 4. 
Example 4- Consider St = {^1,^2}, S' T — {ai,a 2 ,a^}, 
and associated frequencies (0.6, 0.4) and (0.5, 0.3, 0.2). Sup- 
pose G = {{a± ~ a'i rsj a' 2 },{a2 ~ ^3}}, and the align- 
ment algorithm derives M = {{ai ~ a[},{a2 ~ ^3}}- We 
have c(ai) = {ai}, c(a 2 ) = {g^}, while cg(oi) = {ai,^}, 
cg(cl2) = {^3}. Therefore: 

pr(c(ai)) = ^| * correct(a\, a[) = 1 and pr{c(a2)) — 1; 
Precision = Q 6 ° + 6 Q 4 *pr(cl) + 4 ° + 4 6 *pr(c2) = 1; 
rc(c(ai))= Q 5 ° + 5 Q 3 *correct(ai,a' 1 )+ Q 5 ° + 3 3 * correct (ai,a' 2 ) 

= M * 1 +?n§ * = °- 625 ' and rc ( C2 ) = 

Recall = * rc(c(oi)) +^^1 * rc(c(a 2 )) = 0.775. ■ 

Finding Matches with WikiMatch. For each entity type 
in the two language pairs, we ran WikiMatch and derived a 
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set of matches. Table 1 shows examples of such matches. 
Note that we are able to find alignments where an attribute 
in one language is mapped to two (or more) attributes in 
the other language. For this experimental evaluation, we 
configured WikiMatch as follows: the threshold T s i m used 
for both vsim and Isim was set to 0.6; the LSI threshold 
(Tlsi) was set to 0.1. The same values were used for all 
languages and entity types without any special tuning. 

Table 1: Some alignments identified by WikiMatch 



Type 


Portuguese-English 


Vietnamese-English 


Movie 


direcao ~ directed by 
idioma original ~ language 
elenco original ~ starring 
roteiro ~ written by 
lancamento ~ release date 


dao dien ~ directed by 
ngon ngu ~ language 
dien vien ~ starring 
kich ban ~ written by 
kich ban ~ story by 


Actor 


nascimento ~ born 
data de nascimento ~ born 
falecimento ~ died 
morte ~ died 

outros nomes ~ other names 


noi sinh ~ born 

vai tro ~ occupation 

cong viec ~ occupation 

chong ~ spouse 

ten khac ~ other names 



4.1 Comparison against Existing Approaches 

We compared WikiMatch to techniques for schema match- 
ing, cross- language information retrieval, and to a system 
designed to align and complete Wikipedia Templates across 
languages. They are described below. 

—LSI. We use LSI [7] as a technique for cross- language 
attribute alignment. LSI similarity scores were computed for 
all attribute pairs {a p , a q } in an entity type T, where a p £ L 
and a q £ L' . The top 1, 3, 5, and 10 scoring correspondences 
for each a p were used to identify matches. The best F- 
measure value was obtained by the top-1 configuration. 
—Bouma. This approach for aligning infobox attributes across 
languages uses attribute values and cross- language links [5] 
(see Section 6). The input to Bouma was the same provided 
to WikiMatch, i.e., attributes grouped by their entity types. 
—COMA + +. This schema matching framework supports 
both name- and instance-based matchers. We ran COMA++ 
with three configurations: name matching; instance match- 
ing; and a combination of both. To emulate approaches used 
in cross-language ontology alignment [10, 12], we tested a 
variation of COMA++ where Google Translator [14] and 
our automatically generated dictionary (Section 3.2) were 
used to translate attribute labels and values, respectively. 
The best configuration for Pt-En uses translation for both 
attribute names and values. For Vn-En, translating only the 
values provided the best results. 1 

Effectiveness of WikiMatch. Table 2 shows the results 
of the evaluation measures for the alignments derived by 
the different approaches applied to all entity types in our 
datasets. Here, we show only the results for the configura- 
tions that led to the highest F-measure (see Appendix C for 
the results of other configurations). In Table 2, the last row 
for each language pair shows the average across all types. 
The highest scores for each type/metric are shown in bold. 

WikiMatch obtained the highest F-measure values for al- 
most all types and language pairs. Its recall is lower than 
Bouma's for film in Pt-En, because it missed correct matches 
involving rare attributes, which occur in less than 0.5% of 
the infoboxes. In terms of precision, Bouma and COMA++ 
outperformed WikiMatch for some types. Still, considering 

1 We also experimented with different similarity thresholds 
and selected the values that led to the best F-measure score. 



Table 2: Weighted Precision (P), Recall (R), and 
F-measure (F) for the different approaches. 



Portuguese-English 



Type 


wiKiiviatcn 
P R F 


Bouma 
P R F 


riAM * j i_ 

P R F 


LSI 
P R F 


— 

film 


A Q7 A AC A Q£ 

V.y 1 V.yj V.yO 


A 7Q n OO A CC 

V./y v.yy U.oo 


A OO A A 0*7 

v.yy V.yj v.y 1 


A A 1 A OA A AO 

U.U1 U.zU U.Uz 


show 


1 nn a on n f\A 
l.UU U.oV U.94 


A A CO A 7? 

U.oz (J. 65 O./d 


A AO A CO A CO 

U.9o U.S>z U.60 


A AT A AC A f\C 

0.0/ O.Ud 0.06 


actor 


l.UU Vf.jA U.Oo 


1 on f> ia n 
i.uu u.z^t v.jy 


U. IK) U. 'Z u.ou 


f> 1 C, A OA A 1 Q 

u.ij u.zo u.iy 


artist 


1 AA A "71 A QA 

l.UU U. 11 U.o4 


1 AA A cc ft T1 

l.UU V.J J U./l 


1 AA A 1/1 A C 1 

l.UU U.34 U.M 


A OC A CA A AA 

U./j V.jV u.ou 


channel 


A OA A /TA A HA 

U.oU {J.Oy U. /4 


1 AA A 11 A ^A 
l.UU V.JJ U.jU 


A QQ A ^.A A AC 
V.oy U.JO U.Oo 


A OA A A(\ A 10 
U.ZO U.4U U.iz 


company 


0.86 0.87 0.87 


1.00 0.53 0.69 


0.95 0.70 0.81 


0.67 0.74 0.71 


comics ch. 


0.97 0.87 0.92 


0.99 0.65 0.79 


0.99 0.77 0.86 


0.37 0.53 0.43 


album 


1.00 0.93 0.96 


1.00 0.69 0.82 


1.00 0.77 0.87 


0.56 0.48 0.52 


adult actor 


0.84 0.59 0.69 


1.00 0.26 0.41 


0.73 0.43 0.54 


0.22 0.19 0.20 


book 


0.80 0.75 0.77 


0.75 0.58 0.66 


0.75 0.66 0.70 


0.15 0.36 0.21 


episode 


0.81 0.90 0.85 


0.86 0.32 0.47 


1.00 0.38 0.55 


0.09 0.17 0.12 


writer 


1.00 0.49 0.65 


1.00 0.22 0.36 


1.00 0.27 0.43 


0.60 0.49 0.54 


comics 


0.92 0.65 0.76 


1.00 0.13 0.23 


0.91 0.45 0.61 


0.00 0.00 0.00 


fictional ch. 


1.00 0.69 0.82 


1.00 0.06 0.11 


0.81 0.81 0.81 


0.36 0.37 0.36 


Avg 


0.93 0.75 0.82 


0.94 0.45 0.55 


0.91 0.58 0.69 


0.30 0.34 0.31 


Vietnamese-English 


Type 


WikiMatch 
P R F 


Bouma 
P R F 


COMA++ 
P R F 


LSI 
P R F 


film 


1.00 0.99 0.99 


1.00 0.99 0.99 


1.00 0.91 0.95 


0.65 0.62 0.63 


show 


1.00 0.88 0.93 


1.00 0.36 0.53 


1.00 0.61 0.76 


0.57 0.49 0.53 


actor 


1.00 0.49 0.66 


1.00 0.28 0.44 


1.00 0.39 0.56 


0.49 0.35 0.41 


artist 


1.00 0.65 0.79 


1.00 0.32 0.48 


1.00 0.25 0.40 


0.72 0.50 0.59 


Avg 


1.00 0.75 0.84 


1.00 0.49 0.61 


1.00 0.54 0.67 


0.61 0.49 0.54 



the results averaged across all entity types, we tie in preci- 
sion for Vn-En and come very close for Pt-En. By appro- 
priately setting the thresholds, our approach can be tuned 
to obtain higher precision. However, since one of our goals 
is to improve recall for multilingual queries (see Section 5), 
where having more matches leads to the retrieval more rele- 
vant answers, we aim to obtain a balance between recall and 
precision. 

WikiMatch outperforms the multilingual COMA++ con- 
figurations. This indicates that the combination of machine 
translation and string similarity is not effective for determin- 
ing multilingual matches. This observation is also supported 
by the low F-measure scores for the name-based matching 
configuration (see Appendix C). 

Overall, LSI produced the worst results. This is due to 
the fact that it only uses co-occurrences as a source of simi- 
larity; it does not leverage other sources of similarity which 
are important to distinguish between correct and incorrect 
correspondences. In addition, while LSI performs well given 
parallel input, in our scenario, its effectiveness is reduced 
due to the heterogeneity among infoboxes in different lan- 
guages (see Appendix A). 

Effect of Cross-Language Heterogeneity. Comparing 
results across languages, we see that Vn-En alignments were 
more accurate than the Pt-En in some cases, despite the fact 
that English is morphologically more similar to Portuguese. 
The reason for this behavior is that the dual-language in- 
foboxes for Pt-En are more heterogeneous than the ones 
for Vn-En. Using our gold data, we calculated the overlap 
between attributes for pairs of corresponding infoboxes in 
languages L and L' (Appendix A) . The result of this analy- 
sis showed that the overlap is significantly higher for Vn-En. 
For example, for the entity type film the overlap is 87% for 
Vn-En and only 36% for Pt-En. As a result, nearly all meth- 
ods did better for this type for Vn-En. We also computed 
the correlations for overlap and the results for the different 
approaches. For all approaches, the coefficients show posi- 
tive correlations among overlap and the results, indicating 
the results tend to be better for types that are more ho- 
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mogeneous across languages. Still, WikiMatch outperforms 
other approaches for entity types with both high (e.g., film 
in Vn-En) and low overlap (e.g., channel). 
Limitations. We should note that not all correct attribute 
pairs co-occur in the data — some will not be found in any 
dual-language infobox. For example, no dual-language (Pt- 
En) infobox contains the attributes premios and awards 
even though they are synonyms. Like other approaches, 
WikiMatch is not able to identify such matches since all sim- 
ilarity measures return low scores. However, these are rare 
matches, which as we see from the results, do not signif- 
icantly compromise recall. Another limitation of our ap- 
proach is that, currently, it does not support languages that 
do not use alphabetical characters. 

4.2 Contribution of Different Components 

We analyzed how much each component of WikiMatch 
contributes to the results by running it multiple times, and 
each time removing one of the components. The results, av- 
eraged over all types, are summarized in Table 3. WikiMatch 
leads to the highest F-measure values, showing that the com- 
bination of its different components is beneficial. 
WikiMatch- ReviseUncertain. When ReviseUncertain 
is omitted, recall drops substantially while there is little or 
no change to precision. This underscores the importance of 
this step: ReviseUncertain leads to F-measure gains be- 
tween 14% and 20% for the two language pairs. We note 
that the effectiveness of ReviseUncertain varies across the 
different types: types whose correspondences have low value 
similarity tend to benefit more from ReviseUncertain. 
WikiMatch-IntegrateMatches. This configuration gen- 
erates matches without the IntegrateMatch.es step, which 
check the pairwise correlation constraints for the attributes 
in a match. As we discuss below, removing this step leads 
to a drop in precision for both Pt-En and Vn-En. This hap- 
pens because it finds some incorrect matches that have high 
Isim or vsim values, which in WikiMatch are filtered out by 
Int egr at eMat che s . 

WikiMatch random. To assess the contribution of or- 
dering candidate pairs by their LSI scores, we compared it 
to a random ordering, while maintaining both value and 
link similarity constraints to validate match candidates. As 
the results show, the random ordering leads to significantly 
lower values for both precision and recall. This indicates the 
LSI ordering is effective at reducing error propagation. 
WikiMatch single step. In WikiMatch single step, we 
omit the invocation of IntegrateMatches (line 17 in Al- 
gorithm 1) and consider as correspondences all candidates 
whose Isim or vsim values are positive. The sharp decline 
in F-measure provides evidence that considering certain and 
uncertain matches separately is crucial. 
Similarity Features. We have also studied the contri- 
bution of different similarity sources. We report the re- 
sults of three variations of WikiMatch where each omits 
the use of one feature: WikiMatch- vsim, WikiMatch- Isim, 
and WikiMatch-LSI. For WikiMatch-LSI, the candidate pairs 
were sorted in decreasing order of max (vsim, Isim), and 
validated by the constraints on just these features. The 
numbers indicate that value similarity is the most impor- 
tant feature. Without vsim, F-measure drops about 29% 
in Portuguese and 19% in Vietnamese. Link similarity has 
a bigger impact Vietnamese than Portuguese. As expected, 
this feature is likely to be more important for language pairs 
with more diverse morphologies. For example, link similar- 



ity contributes 13% in precision for Vietnamese, while for 
Portuguese the contribution is 1%. Without LSI, F-measure 
drops 12% in Portuguese and 7% in Vietnamese. 

Figure 3 shows how WikiMatch (WM) and WikiMatch 
without ReviseUncertain (WM*) behave when each of the 
features is removed. In all cases, the recall of WM is higher. 
This confirms the importance of ReviseUncertain, which 
is able to identify additional correct matches even when 
WikiMatch is given less evidence. 

Table 3: Contribution of different components 



Configuration 


Portuguese-English 


Vietnamese-English 


P R F 


P R F 


WikiMatch 


0.93 0.75 0.82 


1.00 0.75 0.84 


WikiMatch-ReviseUncertain 


0.94 0.54 0.66 


1.00 0.59 0.72 


WikiMatch-IntegrateMatches 


0.84 0.70 0.75 


0.95 0.74 0.82 


WikiMatch random 


0.74 0.40 0.50 


0.77 0.56 0.64 


WikiMatch single step 


0.39 0.89 0.52 


0.56 0.88 0.64 


WikiMatch-vsim 


0.90 0.43 0.58 


1.00 0.51 0.68 


WikiMatch-lsim 


0.92 0.74 0.82 


0.87 0.70 0.78 


WikiMatch-LSI 


0.83 0.64 0.72 


0.89 0.69 0.78 



1.0 
0.8 
0.6 
0.4 
0.2 
0.0 



WM* WM WM*|WM WM*|WM 
no vsim no Isim no LSI 
Pt-En 
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n no Isim no LSI no vsim no Isim no LSI 
Vn-En 

i Recall 



Figure 3: Impact of ReviseUncertain 

5. CASE STUDY: EVALUATING 
CROSS-LANGUAGE QUERIES 

The usual approach to answering cross-language queries is 
to translate the user query into the language of the articles, 
and then proceed with monolingual query processing. Our 
attribute correspondences can help retrieval systems in this 
translation process. 

To show the benefits of identifying the multilingual at- 
tribute correspondences, below, we present a case study 
using WikiQuery [25], a system that supports structured 
queries over infoboxes. WikiQuery supports c-queries, which 
consist of a set of constraints on entity types, attribute 
names and values. For example, for the query: What are the 
Web sites of Brazilian actors who starred in films awarded 
with an Oscar?, the corresponding c-query is expressed as: 
Q: Actor (born— Brazil, website=?) and Film (aw ard— Oscar), 
where, Actor and Film are entity types; born, website, and 
award are attribute names. 

The matches identified by WikiMatch for a given language 
pair are stored in a dictionary. To provide multilingual an- 
swers to a query, WikiQuery looks up the dictionary and 
retrieves, for each term in the source language, its transla- 
tions into the target language. If a translation cannot be 
found for a given attribute a, the query is relaxed by remov- 
ing the constraint on a. 

The Experiment. We ran a set of ten c-queries (Table 4) 
in Portuguese and Vietnamese on the respective language 
datasets. We then translated the queries into English (as 
described above) and ran them over the English dataset. 
For each query, the top 20 answers were presented to two 
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k answers 



Figure 4: Cumulative Gain of k answers 

evaluators who were required to give each answer a score on 
a five-point relevance scale. The results were evaluated in 
terms of cumulative gain (CG) [16], which has been widely 
used in information retrieval. CG is the total relevance score 
of all answers returned by the system for a given query and 
it allows us examine the usefulness, or gain, of a result set. 
Figure 4 shows the CG for Portuguese queries run over the 
Portuguese infoboxes (Pt) and for Vietnamese queries run 
over the Vietnamese infoboxes (Vn); and the CG for these 
queries translated into English run against the English in- 
foboxes (Pt—>En and Vn-^En). We can see that CG is 
always larger for the queries translated into English. This 
shows that our attribute correspondences help the transla- 
tion and lead to the retrieval of more relevant answers. Be- 
cause the English dataset covers a considerable portion of 
the contents both in Portuguese and Vietnamese infoboxes, 
it often returns many more answers. 

Even though the CG is larger when the queries are trans- 
lated into English, the gain for Vn-^En queries is smaller 
than the one obtained for Pt—>En. This is due, in part, to 
an artifact of our translation procedure. The Vietnamese 
dataset is very small, and many of the English types and 
attribute names do not have any correspondences in Viet- 
namese. As a result, the queries in our workload that include 
these dangling types and attribute names cannot be trans- 
lated and are relaxed by WikiQuery. Although answers are 
returned for the relaxed queries, few (and sometimes none) 
of them are relevant. Since the Portuguese dataset is larger 
than the Vietnamese dataset, this problem is attenuated. 

6. RELATED WORK 

Cross- language matching has received a lot of attention 
in the information retrieval and natural language processing 
communities (see e.g., [9, 21]). While their focus has been 
on documents represented in plain text, our work deals with 
structured information. More closely related to our work are 
recent approaches to ontology matching, schema matching, 
and infobox alignment, which we discuss below. 
Cross-Language Ontology Alignment. Fu et al. [12] 
and Santos et al. [10] proposed approaches that translate 
the labels of a source ontology using machine translation, 
and then apply monolingual ontology matching algorithms. 
The Ontology Alignment Evaluation Initiative (OAEI) [28] 
had a task called very large crosslingual resources (VLCR). 
VLCR consisted of matching three large ontologies includ- 
ing DBpedia, WordNet, and the Dutch audiovisual archive 
and made use of external resources such as hypernyms re- 
lationships from WordNet and Euro WordNet — a multilin- 
gual database of WordNet for several European languages. 
Although related, there are important differences between 



Table 4: List of c-queries used in the Case Study 

Movies with an actor who is also a politician 

1 filme(nome=?) and ator(ocupacao="politico") 

phim(ten=?) and dien vien (cong viec ="chinh khach") 

Actors who worked with director Francis Ford Coppola in a movie 

2 filme(nome=?) and ator(nome=?) and diretor(nome="francis ford coppola") 

phim(ten=?) and dien vien(ten=?) and dao dien(ten="francis ford coppola") 

Movies that won Best Picture Award and were directed by a director from England 
filme(direcao=?) and premio(melhor filme=?) and 

3 diretor(nascimento| pais de nascimento|pais|data de nascimento="Inglaterra") 
phim(dao dien=?) and giai thirdng(phim xuat sac nhat=?) and 

dao dien(sinh|noi sinh="anh") 

Movies directed by director younger than 40 (born after 1970) and that have gross 

4 revenue greater than 10 million 

filme(receita > 10000000) and diretor(nascimento|data de nascimento >=1970) 
phim(doanh thu|thu nhap > 10000000) and dao dien(sinh|ngay sinh >=1970) 
Books that were written by a writer born before 1975 

5 livro(nome=?) and escritor(nascimento< 1975) 
sach(ten=?) and nha van(ngay sinh<1975) 
Names of French Jazz artists 

6 artista(nome=?, nascimento |pais de nascimento|pais|data de nascimento-'Franca", 
genero-' Jazz") 

nghe si(ten=?, sinh|noi sinh="Phap", the loai="Jazz") 

Characters created by Eric Kripke 

7 personagem (nome=?, criado por="Eric Kripke") 

nhan vat(ten=?, sang tac="Eric Kripke") 

Names of the albums from the genre "rock" recorded before 1980 

8 album(nome=?, genero = "Rock", gravado em <1980) 

album(ten=?, the loai = "Rock", ghi am|thu am <1980) 

Names of artists from the genre "progressive rock" who have been born after 1950 

9 artista(nome=?, genero = "Rock Progressivo", nascimento|data de nascimento > 1950) 

nghe sl(ten=?, the loai = "Progressive Rock", sinh|nam sinh > 1950) 

Headquarters of companies with revenue greater than 10 billion 

10 companhia (sede=?, faturamento > 10 bilhoes) 

cong ty(tru so|tru so chinh=?, doanh thu|thu nhap > 10 billion) 



these approaches and ours. While ontologies have a well- 
defined and clean schema, Wikipedia infoboxes are hetero- 
geneous and loosely defined. In addition, these works con- 
sider ontologies in isolation and do not take into account 
values associated with the attributes. As we have discussed 
in Section 4, values are an important component to accu- 
rately determine matches. Last, but not least, in contrast 
to VLCR, our approach does not rely on external resources. 
Schema Matching. The problem of matching multilingual 
schemas has been largely overlooked in the literature. The 
only work on this topic aimed to identify attribute corre- 
spondences between English and Chinese schemas [37], re- 
lying on the fact that the names of attributes in Chinese 
schemas are usually the initials of their names in PinYin 
(i.e., romanization of Chinese characters). This solution not 
only required substantial human intervention and a manu- 
ally constructed domain ontology, but it only works for Chi- 
nese and English. Although it is possible to combine tra- 
ditional schema matching approaches [31] with automatic 
translation (similar to [12, 10]), as shown in Section 4, this 
is not effective for matching multilingual infoboxes. 

Also related to our approach are techniques for uncertain 
schema matching and data integration. Gal et al. [4] de- 
fined a class of monotonic schema matchers for which higher 
similarity scores are an indication of more precise mappings. 
Based on this assumption, they suggest frameworks for com- 
bining results from the same or different matchers. However, 
due to the heterogeneity across infoboxes, this assumption 
does not hold in our scenario: matches with high similarity 
scores are not necessarily accurate. To this hypothesis, we 
have experimented with different similarity thresholds for 
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COMA++, and for higher thresholds, we have observed a 
drop in both precision and recall. 

Cross-Language Infobox Alignment. Adar et al. [1] 
proposed Ziggurat, a system that uses a self-supervised clas- 
sifier to identify cross- language infobox alignments. The 
classifier uses 26 features, including equality between at- 
tributes and values and n-gram similarity. To train the clas- 
sifier, Adar et al. applied heuristics to select 20K positive 
and 40K negative alignment examples. Through a 10-fold 
cross-validation experiment with English, German, French, 
and Spanish, they report having achieved 90.7% accuracy. 
Bouma et al. [5] designed an alignment strategy for English 
and Dutch which relies on matching attribute-value pairs: 
values ve and vd are considered matches if they are identi- 
cal or if there is a cross- language link between articles corre- 
sponding ve and vd- A manual evaluation of 117 alignments 
found only two errors. Although there has not been a di- 
rect comparison between these two approaches, Bouma et al. 
state that their approach would lead to a lower recall. But 
the superior results obtained by Ziggurat rely on the avail- 
ability of a large training set, which limits its scalability and 
applicability: training is required for each different domain 
and language pair considered; and the approach is likely to 
be effective only for domains and languages that have a large 
set of representatives. Adar et al. acknowledge that because 
their approach heavily relies on syntactic similarity (it uses 
n-grams), it is limited to languages that have similar roots. 
In contrast, WikiMatch is automated — requiring no train- 
ing, and it can be used to create alignments for languages 
that are not syntactically similar, such as for example, Viet- 
namese and English. Nonetheless, we would have liked to 
compare Ziggurat against our approach, in particular, for 
the Pt-En language pair. Unfortunately, we were not able 
to obtain the code or the datasets described in [1]. 

7. CONCLUSION 

In this paper, we proposed WikiMatch, a new approach for 
aligning Wikipedia infobox schemas in different languages 
which requires no training and is effective for languages with 
different morphologies. Furthermore, it does not require ex- 
ternal sources such as dictionaries or machine translation 
systems. WikiMatch explores different sources of similarity 
and combines them in a systematic manner. By prioritiz- 
ing high-confidence correspondences, it is able to minimize 
error propagation and achieve a good balance between re- 
call and precision. Our experimental analysis showed that 
WikiMatch outperforms state-of-the-art approaches for cross- 
language information retrieval, schema matching, and multi- 
lingual attribute alignment; and that it is effective for types 
that have high cross- language heterogeneity and few data in- 
stances. We also presented a case study that demonstrates 
the benefits of the correspondences discovered by our ap- 
proach in answering multilingual queries over Wikipedia: by 
using the derived correspondences, we can translate queries 
posed in under-represented languages into English, and as a 
result, return a larger number of relevant answers. 

There are a number of problems that we intend to pur- 
sue in future work. To further improve the effectiveness 
of WikiMatch, we would like to investigate the use of a 
fixed point-based matching strategy, such as similarity flood- 
ing [23]. Because our approach is automated, the results it 
produces can be uncertain or incorrect. To properly deal 
with this issue during the evaluation of multilingual queries, 



we plan to explore approaches that take uncertainty into ac- 
count [8]. While in this paper we focused on infoboxes, we 
would like to investigate the effectiveness of WikiMatch on 
other sources of structured data present in Wikipedia. 
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APPENDIX 

A. STRUCTURAL HETEROGENEITY 

Using the alignments in our ground truth sets (Section 4), 
we analyzed the structural heterogeneity of the infoboxes by 
considering the overlap among attribute sets from infoboxes 
in a given language pair. For each infobox II in language L 
which has a cross-language link cl to its equivalent infobox 
I' L , in language L' , we computed the overlap between their 
schemas Si and Sj as the size of intersection between at- 
tributes in Si and Sj over the size of their union. To be 
considered part of the intersection, an attribute pair must 
appear in the ground truth. 

The results for each entity type and language pair are 
shown in Table 5. The English- Vietnamese pair is more ho- 
mogeneous than the English-Portuguese pair. Considering 
only the entity types appearing in both language pairs (i.e., 
film, show, actor, and artist) the average overlap is 44% for 
Portuguese-English and 69% for Vietnamese-English. As 
seen in our experimental results (Table 2), all approaches 
we considered have better results when the overlap is larger. 



Table 5: Overlap in infoboxes 
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Pt-En 


36% 


45% 


42% 


52% 
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31% 


59% 


52% 
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63% 


47% 


32% 


Vn-En 


87% 


75% 


46% 


67% 





B. ADDITIONAL RESULTS 

Macro- averaging. The weighting employed in the evalu- 
ation metrics used in Section 4 can be considered as micro- 
averaging. We also computed macro- averaging by discard- 
ing the weights and just counting distinct attribute- name 
pairs. The results in Table 6 show that WikiMatch is still 
outperforms the other approaches. 



Table 6: Macro-averaging results 





WikiMatch 
P R F 


Bouma 
P R F 


COMA++ 
P R F 


LSI 
P R F 


PT-EN 


0.88 0.60 0.71 


0.93 0.36 0.52 


0.79 0.47 0.59 


0.27 0.28 0.27 


VN-EN 


1.00 0.58 0.73 


1.00 0.34 0.51 


0.93 0.45 0.60 


0.60 0.43 0.50 



Threshold Sensitivity. We have studied the sensitivity 
of WikiMatch to variations in the thresholds used in our 
algorithms. Figure 5 shows the variation of the weighted 
F-measure as the thresholds T S i m and Tlsi increase. The 
lines show that WikiMatch is stable over a broad range of 
threshold values. As a general guideline, Tlsi should be set 
low since the main purpose of LSI is to sort the candidate 
matches, while T sim should be set high as it determines the 



143 



selection of the high-confidence matches. We observe a simi- 
lar behavior for both language pairs: although the highest F- 
measure is achieved around T S i m = 0.6, the values obtained 
for all thresholds are comparable. The LSI score is used to 
sort the priority queue containing the candidate pairs. How- 
ever, only attribute pairs that surpass Tlsi are inserted into 
this queue. Again, the curves for Tlsi are similar for both 
language pairs. F-measure changes very little for Tlsi val- 
ues between and 0.6. High values of Tlsi reduce recall 
and, as a consequence, F-measure also decreases. 
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Figure 5: Impact of different thresholds 

Alternatives for Attribute Correlation. Besides LSI, 
we have explored other measures for cross- language attribute 
correlation. Three possibilities were considered to capture 
the correlation between attributes a p and a q : XI — O pq ; 



X2 = (l + ^)(l + ^);and X3 



Opq . Op 
O p + O q 



O p and O q are the number of occurrences of attributes a p 
and a q respectively, and O pq is the number of times they 
co-occur in the set of dual-language infoboxes of entity type 
T. Recall that the correlation score is used to order the 
candidate matches (Algorithm 1). Therefore, the best cor- 
relation measure for our approach is the one that leads to an 
ordering where the correct matches appear before the incor- 
rect ones. We analyzed the ranking of matches produced by 
each of these measures in terms of mean average precision 
(MAP) [22], which is the standard evaluation measure for 
ranked items in information retrieval. It is calculated as: 

1 1 j=i 3 k=i 

where |A| is the number of attributes in language L, Rjk is 
the set of ranked pairs from the top result until attribute a^, 
P is the precision, and rrij is the number of correct matches 
for attribute j. A perfect ordering (MAP =1) would place 
all correct matches before the first incorrect match. 

MAP values for LSI and the variations of the correlation 
score are shown in Table 7. To serve as baseline, we tried 
randomly ordering the attribute pairs. The results show 
that LSI provides the best ordering. Note, however, that all 
variations of X are superior to random ordering. The supe- 
riority of LSI can be attributed to two factors: the dimen- 
sionality reduction brought by SVD which groups together 
similar infoboxes, and the fact that in addition to the co- 
occurrence frequency in dual-language infoboxes (which is 
also considered in XI, X2 and A3), it takes into account 
the occurrence pattern of the attribute pairs over the dual- 
language infoboxes (through the cosine distance). 

Table 7: MAP for different sources of correlation 
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Figure 6: Top-K LSI results 
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Figure 7: Different COMA-| — |- configurations 

C. TUNING COMPETITOR SYSTEMS 

In Section 4, we presented only the best configuration of 
each approach. Here we show the results of other configu- 
rations we tested. 

LSI. Besides providing a source of attribute correlation in 
WikiMatch, LSI was also one of our baselines. In our ex- 
periments, we considered the top-k scoring matches as the 
alignments identified by LSI and computed the evaluation 
metrics. Figure 6 shows how LSI behaves for k values G 
{1,3, 5, 10}. As expected, recall increases with k, while pre- 
cision decreases. 

COMA-| — |-. Figure 7 shows the results for different config- 
urations of COMA++: name matcher (N), instance matcher 
(I), name+instance matcher (NI), using Google Transla- 
tor for attribute names (N+G), and our automatically con- 
structed dictionary for instances (I+D) and attribute names 
(N+D). For each configuration, we used Multiple(0,0,0) to 
select candidate matches as it yielded the highest F-measure. 
We also tried thresholds (A) from to 1 with increments of 
0.1. We chose the configuration NG+ID for Pt-En, and ID 
for Vn-En and A = 0.01 since these led to the highest F- 
measure. NG+ID had the best results in Pt-En because it 
combines information from more sources (names, instances, 
and translation). Note that the I configuration performs 
almost as well as the best configurations which use transla- 
tion. While translation helps in some cases, in other cases 
an incorrect translation does more harm than good. For Vn- 
En, the translation of attribute names was not helpful. For 
instance, dien vien was translated to actor instead of star- 
ring, and kinh phi was translated to funding instead of bud- 
get. When N and I matchers are combined, the N matcher 
returns higher similarity scores an thus take precedence over 
the more reliable but lower scores of the I matcher. There- 
fore, NG+ID has worse results than ID only, for Vn-En. We 
note that even with a similarity threshold as low as 0.01, 
the highest recall for the best configuration of COMA++ is 
0.58 for Pt-En and 0.54 for Vn-En, while for WikiMatch, at 
low thresholds, we obtain recall around 0.75. 
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